Skip to content

Qualcomm AI Engine Direct - Decouple quantization and compile graphs for faster VLM/LLM PTQ#19220

Open
DannyYuyang-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/optimize_mllm_ptq
Open

Qualcomm AI Engine Direct - Decouple quantization and compile graphs for faster VLM/LLM PTQ#19220
DannyYuyang-quic wants to merge 1 commit intopytorch:mainfrom
CodeLinaro:dev1/danny/optimize_mllm_ptq

Conversation

@DannyYuyang-quic
Copy link
Copy Markdown
Contributor

@DannyYuyang-quic DannyYuyang-quic commented Apr 30, 2026

Summary

  • Calibrate decoder using prefill stage only (full chunk tokens)
  • Remove need for AR-N calibration loops
  • Significantly reduce calibration overhead
model name before
Time(sec)
after
Time(sec)
speedup
gemma-2b 1216 259 4.69x
gemma2-2b 1827 382 4.78x
gemma3-1b 907 218 4.16x
glm-1_5b 963 167 5.76x
granite_3_3-2b 1545 304 5.08x
llama3_2-1b 1237 285 4.34×
llama3_2-3b 2286 813 2.81x
phi_4_mini 2824 363 7.77x
qwen2_5-0_5b 486 119 4.08x
qwen2_5-1_5b 1068 220 4.86×
qwen3-0_6b 1013 158 6.41×
qwen3-1_7b 1478 283 5.22×
smollm2_135m 399 122 3.27×
smollm3-3b 2065 431 4.79x
smolvlm_500m_instruct 170 131 1.30×
internvl3_1b 170 103 1.65x
granite_speech_3_3-2b 447 215 2.07x

This change decouples the quantization graph from the graph used for subsequent lowering, so calibration no longer depends on the AR-N decoding flow.

Previously, we were running calibration directly on the graph shaped for lowering (with fixed AR-N constraints). That forced us into an autoregressive loop (AR1 per step), which was both inefficient and slow since we never saw the full sequence context in a single pass.

With this update, calibration is done once during the prefill stage using the full tokens chunk. This gives us much better coverage in a single run and completely removes the need for iterative decoding during calibration.

After quantization, we take the KV cache encodings from the output, override the input KV cache encodings, and then propagate those into the graph that will later be lowered. This keeps everything consistent without needing to recalibrate on that graph.

Result: same accuracy, significantly faster calibration, and a much cleaner separation between quantization and lowering

Test plan

Test CI in TestExampleLLMScript and TestExampleMultimodalityScript

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19220

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 11 Awaiting Approval

As of commit a447ba2 with merge base e84a418 (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026
@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

Hi @abhinaykukkadapu,
This PR optimizes the PTQ calibration flow for VLM/LLM.
Details are in the top comment.
Please have a look!
Thanks!!

@DannyYuyang-quic
Copy link
Copy Markdown
Contributor Author

@pytorchbot label "release notes: qualcomm"

@pytorch-bot pytorch-bot Bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Apr 30, 2026
faster VLM/LLM PTQ

Summary:
 - Calibrate decoder using prefill stage only (full chunk input_ids)
 - Remove need for AR-N calibration loops
 - Significantly reduce calibration overhead
@DannyYuyang-quic DannyYuyang-quic force-pushed the dev1/danny/optimize_mllm_ptq branch from c3f07e0 to a447ba2 Compare April 30, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant